Skip to content

fix: correct judge prompt construction in MM-MT-Bench#14

Open
abdelhadi703 wants to merge 1 commit into
mistralai:mainfrom
abdelhadi703:fix/mm-mt-bench-judge-prompt
Open

fix: correct judge prompt construction in MM-MT-Bench#14
abdelhadi703 wants to merge 1 commit into
mistralai:mainfrom
abdelhadi703:fix/mm-mt-bench-judge-prompt

Conversation

@abdelhadi703

Copy link
Copy Markdown

Summary

  • Fix message content extraction in judge prompt construction (iterating over dict keys instead of values)
  • Fix reference answer extraction (passing full dict instead of text content)
  • Add image support in judge prompts (convert PIL images to base64 for the judge model)
  • Fix _add_or_append_chunk() to actually append image chunks to the prompt
  • Fix operator precedence in health check condition (models.py)

Fixes #8

Changes

  • eval/tasks/mm_mt_bench.py: Extract .content from message dicts, extract text from reference answer chunks, convert PIL images to base64 image_url chunks, fix image chunk appending
  • eval/models.py: Add parentheses to fix operator precedence in _wait_till_healthy()

Test plan

  • Run MM-MT-Bench evaluation and verify judge prompts are correctly formatted
  • Verify health check logic works with both empty body and JSON status responses

🤖 Generated with Claude Code

- Extract message content from dicts in get_judgement() instead of
  passing full message dicts to the judge prompt builder
- Add _extract_text_content() to extract plain text from reference
  answers that may be lists of typed chunks
- Add _convert_image_chunks() to convert PIL image objects to base64
  image_url chunks for the judge model
- Fix _add_or_append_chunk() to actually append image chunks to the
  prompt instead of returning them
- Fix operator precedence in _wait_till_healthy() health check condition

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abdelhadi703

Copy link
Copy Markdown
Author

Hi @mistralai/team,

Bumping this fix for MM-MT-Bench judge prompt construction. It ensures correct prompt formatting for evaluation.

Happy to address any review comments. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problem with judge prompt for MM-MT-Bench

1 participant